7 March 2016

Style of this talk

  • Task-based learning.
  • Quick start + References.
  • Plan: Skim through some examples and work closely on two of them.

What you'll able to do after this talk

Preliminaries

Data visualisation. Why?

  • Exploratory analysis

    • Get a sense of what data looks like
    • Help you choose (parametric) models
    • Help you engineer features
  • Confirmatory analysis

    • Model fit evaluation
    • Residual analysis

Examples - choose parametric models

Data: Copenhagen Reinsurance 2167 fire losses records from 1980 to 1990.

library(MASS)
model <- fitdistr(danish, 'lognormal')

Examples - features engieering

Data: Death numbers, ~100 features, a few K datapoints.

Examples - Model fit evaluation

lognormal_model <- fitdistr(danish, 'lognormal')
gamma_model <- fitdistr(danish, 'gamma')
print(c(lognormal_model$loglik, gamma_model$loglik))
## [1] -4433.891 -5243.027

Hard to understand what they actually mean!

Examples - Residual analysis

Here is the output of a linear regression model.

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.202860 -0.140567 -0.003021  0.141850  0.201984 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.003066   0.016315   0.188    0.851    
## x           0.998149   0.009412 106.054   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1419 on 299 degrees of freedom
## Multiple R-squared:  0.9741, Adjusted R-squared:  0.974 
## F-statistic: 1.125e+04 on 1 and 299 DF,  p-value: < 2.2e-16

plot(x,y, type = 'l')
lines(x, fitted(lm_model), col = 'red', lwd = 2)

res <- y - fitted(lm_model)
plot(x, res, type = 'l', ylim = c(-0.4, 0.4))

Fairly clear we have missed something.

R commands

?COMMAND_NAME, e.g.

?max

install.packages("PACKAGE_NAME"), e.g.

install.packages("d3heatmap")

Influential R packages

"2013 - current - future" regime:

  • magrittr, tidyr, dplyr

\(\\\)

"2016 - future" regime :

  • broom, purrr

Good data structure

Static graphics

Static graphcis tasks preview

  • Task 1. What do my data look like? The 'tabplot' package.
  • Task 2. Are my predictor variables correlated? The 'corrplot' package.
  • Task 3. I need professional graphics for publishing purposes. The 'ggplot2' package.

Task 1. What do my data look like?

The 'tabplot' package.

library(ggplot2)
data(diamonds)

library(tabplot)
tableplot(diamonds)

Task 2. Are my predictor variables correlated?

The 'corrplot' package.

load('input_output//cleaned_data', verbose = T)

library(corrplot)
correlation_matrix <- cor(data1[,10:30]) 
corrplot(correlation_matrix, order = "hclust", addrect = 8)

## Loading objects:
##   data1

Task 3. I need professional graphics for publishing purposes.

The 'ggplot2' package.

Professional plots - some examples

ggplot2 - Graphics in 'The Economist'

We are going to create a plot similar to this:

Interactive graphics

Do we really need interactive graphics?

R packages

Interactive graphics tasks preview

  • Task 1. Where is this datapoint? The 'leaflet' package.
  • Task 2. Show me my data without overcrowding the screen. The 'DT' package.
  • Task 3. I'd like to visualise this matrix. The 'd3heatmap' package.
  • Task 4. I need interactivity for everything! The 'plotly' package.

Task 1. Where is this datapoint?

The 'leaflet' package.

Two things you need to construct a basic map:

  • Geographical coordinates (and potentially the corresponding data)
  • Map style

load('input_output//leaflet_data')
head(map_data, 5)
##       station_name latitude longitude station pressure  time trend_coeff
## 1     PUNTA ARENAS   -53.00    -70.85   85934     2000  noon      -0.276
## 2         MARAMBIO   -64.23    -56.72   89055     2000  noon      -0.060
## 3            SYOWA   -69.00     39.58   89532     2000  noon      -0.666
## 4   AMUNDSEN-SCOTT   -90.00      0.00   89009     2000 night      -0.191
## 5 NOVOLAZARAVSKAJA   -70.77     11.83   89512     2000 night      -0.648

Leaflet - Minimal example

library(magrittr)  #for the %>% pipeline operator
library(leaflet)
leaflet(data = map_data) %>% addTiles() %>%
  addMarkers(~longitude, ~latitude, popup = ~as.character(station_name))

Leaflet - Other easy options

  • Can use addProviderTiles(XXX) for other styles of map
  • Some options for XXX are "Stamen.Toner", "Acetate.terrain", "CartoDB.Positron"
  • Reference: https://rstudio.github.io/leaflet/

leaflet(data = map_data) %>% addProviderTiles("Stamen.Toner") %>%
  addMarkers(~longitude, ~latitude, popup = ~as.character(station_name))

Task 2. Show me my data without overcrowding the screen

The 'DT' package.

data(iris)
DT::datatable(iris)

DT - Other easy options

library(DT)
datatable(iris, options = list(pageLength = 6)) %>%
   formatStyle('Sepal.Width',
               backgroundColor = styleInterval(3, c('orange', 'white')))

Reference: https://rstudio.github.io/DT/

Task 3. I'd like to visualise this matrix.

The 'd3heatmap' package.

Usage: explore data

data(mtcars)  #built-in dataset in R
library(d3heatmap)
d3heatmap(mtcars, scale = "column", colors = "Spectral")

Usage: explore correlation between predictors

load('input_output//cleaned_data', verbose = T)
library(d3heatmap)
correlation_matrix <- cor(data1[,2:50])
d3heatmap(correlation_matrix, dendrogram = 'none')

## Loading objects:
##   data1

Task 4. I need interactivity for everything!

The 'plotly' package.

Plotly basics

Five components:

  1. plot_ly
  2. add_trace
  3. layout
  4. annotation_text
  5. other options: how to read the documentation

The End

Next week: ShinyR and Tableau